CS 294 - 1 Homework 3 Timothy Hunter and Andre
نویسنده
چکیده
In this assignment, the goal was to parse a large set of Wikipedia articles, to extract features into a sparse feature matrix and to cluster them with a clustering algorithm of our choice. One motivation to perform automated clustering on unstructured, unlabeled data is to detect correlations between data points; for instance, in the case of Wikipedia, one might be able to automatically group articles into a small set of larger categories such as “Politics”, “People”, etc. Although Wikipedia does provide category labels, labels are not centrally controlled or restricted (any user can create a category), which results in a very large amount of category labels. Also, there is no documentation page that shows a list of labels, and some online resources point out that Wikipedia’s current set o category labels by no means form a tree, but rather a directed acyclical graph.
منابع مشابه
CS294-1 Homework 2: Linear Regression
The goal of this homework assignment was to use linear regression to predict users’ ratings of books sold on Amazon.com. To construct the feature matrix, we tokenized the written review texts in several different ways: With and without stemming, as well as using either single words as tokens or bigrams. The dependent variable is a star rating given by reviewers, ranging from 1 star (very unsati...
متن کاملCS 294 - 1 Homework 1 Mobin Javed Collaborators
First a naive implementation with a feature selection containing few features, results in poor accuracy because (i) the stop words and non-words haven’t been filtered and turn out as the features with high probability, and (ii) with few features the dictionary might be biased towards one class. Table 1 shows the positive and negative dictionary formed by selecting the top ten features from each...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012